Author: Adrien Carrel
Classification tasks are tasks where the ultimate goal of the model is to return a categorical value by making a prediction. We assign to a certain set of values a class (or multiple classes).
Objective: Predict an increase in the vaccination coverage of a country in Latam based on previous data:
Application: Generate early warning systems?
Main imports
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
# change renderer to correctly exportno the plotly figures in the notebook
import plotly.io as pio
pio.renderers.default = "notebook"
Load daily vaccination data in Latam
vac = pd.read_csv("daily_covid_vaccinations_latam.csv")
vac.head(5)
| location | 2020-12-24 | 2020-12-25 | 2020-12-26 | 2020-12-27 | 2020-12-28 | 2020-12-29 | 2020-12-30 | 2020-12-31 | 2021-01-01 | ... | 2022-05-17 | 2022-05-18 | 2022-05-19 | 2022-05-20 | 2022-05-21 | 2022-05-22 | 2022-05-23 | 2022-05-24 | 2022-05-25 | 2022-05-26 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | Antigua and Barbuda | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.00 | 0.00 | 0.0 | 0.0 | ... | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.0 |
| 1 | Argentina | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.04 | 0.09 | 0.1 | 0.1 | ... | 90.07 | 90.07 | 90.08 | 90.08 | 90.09 | 90.09 | 90.09 | 90.10 | 90.10 | 0.0 |
| 2 | Bahamas | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.00 | 0.00 | 0.0 | 0.0 | ... | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.0 |
| 3 | Barbados | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.00 | 0.00 | 0.0 | 0.0 | ... | 56.35 | 56.37 | 56.38 | 0.00 | 56.39 | 0.00 | 56.40 | 56.41 | 56.42 | 0.0 |
| 4 | Belize | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.00 | 0.00 | 0.0 | 0.0 | ... | 0.00 | 0.00 | 0.00 | 58.65 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.0 |
5 rows × 520 columns
Media cloud data in Latam. Keywords: "COVID-19 Vaccination"
vac_media = pd.read_csv("covid_vaccination_attention_latam.csv")
vac_media.head(5)
| date | count_Antigua and Barbuda | total_count_Antigua and Barbuda | ratio_Antigua and Barbuda | count_Argentina | total_count_Argentina | ratio_Argentina | count_Bahamas | total_count_Bahamas | ratio_Bahamas | ... | ratio_Suriname | count_Trinidad and Tobago | total_count_Trinidad and Tobago | ratio_Trinidad and Tobago | count_Uruguay | total_count_Uruguay | ratio_Uruguay | count_Venezuela | total_count_Venezuela | ratio_Venezuela | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1/1/2020 | 0 | 3 | 0.0 | 2 | 2699 | 0.000741 | 0 | 12 | 0.0 | ... | 0.0 | 0 | 25 | 0.0 | 0 | 158 | 0.0 | 0 | 464 | 0.0 |
| 1 | 1/2/2020 | 0 | 6 | 0.0 | 4 | 6319 | 0.000633 | 0 | 51 | 0.0 | ... | 0.0 | 0 | 28 | 0.0 | 0 | 314 | 0.0 | 0 | 678 | 0.0 |
| 2 | 1/3/2020 | 0 | 10 | 0.0 | 5 | 7209 | 0.000694 | 0 | 60 | 0.0 | ... | 0.0 | 0 | 29 | 0.0 | 0 | 363 | 0.0 | 0 | 897 | 0.0 |
| 3 | 1/4/2020 | 0 | 7 | 0.0 | 0 | 3805 | 0.000000 | 0 | 39 | 0.0 | ... | 0.0 | 0 | 14 | 0.0 | 0 | 192 | 0.0 | 0 | 659 | 0.0 |
| 4 | 1/5/2020 | 0 | 8 | 0.0 | 1 | 3771 | 0.000265 | 0 | 34 | 0.0 | ... | 0.0 | 0 | 27 | 0.0 | 0 | 203 | 0.0 | 0 | 773 | 0.0 |
5 rows × 97 columns
Load COVID-19 cases and deaths in Latam
cases = pd.read_csv("daily_covid_cases_latam.csv")
deaths = pd.read_csv("daily_covid_deaths_latam.csv")
Transpose the dataframe
country_names = vac.T.iloc[0].tolist() # list of countries
df = vac.T # transpose dataframe
df.columns = country_names
df = df[1:] # drop the first row that contains the country names
df.index = pd.to_datetime(df.index)
df = df.astype(float) # force float type
# describe our dataset
df.describe()
| Antigua and Barbuda | Argentina | Bahamas | Barbados | Belize | Bolivia | Brazil | Chile | Colombia | Costa Rica | ... | Nicaragua | Panama | Paraguay | Peru | Saint Lucia | Saint Vincent and the Grenadines | Suriname | Trinidad and Tobago | Uruguay | Venezuela | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| count | 519.000000 | 519.000000 | 519.000000 | 519.000000 | 519.000000 | 519.000000 | 519.000000 | 519.000000 | 519.000000 | 519.000000 | ... | 519.000000 | 519.000000 | 519.000000 | 519.000000 | 519.000000 | 519.000000 | 519.000000 | 519.000000 | 519.000000 | 519.000000 |
| mean | 7.631098 | 52.882042 | 2.240116 | 33.958593 | 7.376455 | 29.802620 | 46.839634 | 65.312543 | 20.688227 | 6.754451 | ... | 3.763121 | 6.809152 | 7.189865 | 40.600906 | 7.717264 | 4.802736 | 14.083083 | 24.887360 | 57.606108 | 1.245877 |
| std | 17.552802 | 34.505647 | 8.261362 | 20.815181 | 16.204873 | 22.657337 | 33.225499 | 30.656022 | 29.035664 | 20.816658 | ... | 16.247307 | 20.185087 | 14.041347 | 34.154956 | 11.804160 | 10.412137 | 17.913978 | 22.577068 | 31.828416 | 8.372928 |
| min | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | ... | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 |
| 25% | 0.000000 | 16.120000 | 0.000000 | 21.085000 | 0.000000 | 4.750000 | 10.840000 | 42.740000 | 0.000000 | 0.000000 | ... | 0.000000 | 0.000000 | 0.000000 | 3.085000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 32.955000 | 0.000000 |
| 50% | 0.000000 | 63.680000 | 0.000000 | 35.080000 | 0.000000 | 34.310000 | 57.800000 | 75.370000 | 0.000000 | 0.000000 | ... | 0.000000 | 0.000000 | 0.000000 | 34.750000 | 0.000000 | 0.000000 | 0.000000 | 23.890000 | 75.960000 | 0.000000 |
| 75% | 0.000000 | 86.935000 | 0.000000 | 54.050000 | 0.000000 | 53.880000 | 77.670000 | 91.050000 | 41.115000 | 0.000000 | ... | 0.000000 | 0.000000 | 5.080000 | 75.285000 | 16.955000 | 0.000000 | 32.875000 | 49.330000 | 79.295000 | 0.000000 |
| max | 64.740000 | 90.100000 | 41.940000 | 56.420000 | 58.650000 | 60.820000 | 85.840000 | 93.460000 | 82.350000 | 86.350000 | ... | 85.570000 | 79.270000 | 54.010000 | 87.830000 | 32.080000 | 32.810000 | 45.260000 | 53.360000 | 85.800000 | 77.190000 |
8 rows × 32 columns
Datasaurus Dozen: Always plot your data! Here are some examples of set of points with same x and y mean and variance + correlation.

Remark: If you work in a high dimensional space, use methods like t-SNE to project your data into lower dimensional spaces to plot them
# plot vaccination data for Brazil
plt.plot(df["Brazil"])
plt.title("Always plot your data!")
plt.xlabel("Time")
plt.ylabel("Cumulated vaccination coverage (%)")
plt.xticks(rotation=45)
plt.show()
=> Apply cumulative max on your data
df = df.cummax() # apply cumulative max
plt.plot(df["Brazil"])
plt.title("Always plot your data!")
plt.xlabel("Time")
plt.ylabel("Cumulated vaccination coverage (%)")
plt.xticks(rotation=45)
plt.show()
Process vaccination data: from raw data to increase/constant on a daily basis (binary data)
df = (df > df.shift(1)).astype(int) # transform raw data into binary data
df.head(5)
| Antigua and Barbuda | Argentina | Bahamas | Barbados | Belize | Bolivia | Brazil | Chile | Colombia | Costa Rica | ... | Nicaragua | Panama | Paraguay | Peru | Saint Lucia | Saint Vincent and the Grenadines | Suriname | Trinidad and Tobago | Uruguay | Venezuela | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 2020-12-24 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 2020-12-25 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 2020-12-26 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 2020-12-27 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 2020-12-28 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
5 rows × 32 columns
Transpose deaths and cases
# cases
country_names = cases["location"].tolist()
cases = cases.T
cases.columns = country_names
cases = cases[1:]
cases.index = pd.to_datetime(cases.index)
# deaths
country_names = deaths["location"].tolist()
deaths = deaths.T
deaths.columns = country_names
deaths = deaths[1:]
deaths.index = pd.to_datetime(deaths.index)
Plot cases, deaths, vaccination media coverage for Brazil as an example
fig, (ax1, ax2, ax3) = plt.subplots(1, 3, figsize=(20, 9))
ax1.plot(cases["Brazil"])
ax2.plot(deaths["Brazil"])
ax3.plot(pd.to_datetime(vac_media["date"]), vac_media["ratio_Brazil"])
plt.xlabel("Time")
ax1.tick_params(labelrotation=45)
ax2.tick_params(labelrotation=45)
ax3.tick_params(labelrotation=45)
ax1.set_title("Daily COVID cases in Brazil")
ax2.set_title("Daily COVID deaths in Brazil")
ax3.set_title("Daily COVID-19 Vaccination media coverage in Brazil")
plt.show()
Apply same processing task to cases and deaths (transform the continuous timeseries into a binary one (increase vs decrease))
# cases
cases = (cases > cases.shift(1)).astype(int)
# deaths
deaths = (deaths > deaths.shift(1)).astype(int)
Merge datasets
We crop the other datasets to keep only data from 2020-12-24 to 2022-05-26 (our vaccination dataset).
first_date_df = pd.to_datetime(df.index[0]) # first date in vaccination data
last_date_df = pd.to_datetime(df.index[-1]) # last date in vaccination data
country_names = df.columns
# Media
for country in country_names:
col = vac_media[[f"ratio_{country}"]]
col.index = pd.to_datetime(vac_media.date)
col = col[(col.index >= first_date_df) & (col.index <= last_date_df)]
df[f"ratio_{country}"] = col[f"ratio_{country}"].to_numpy()
# Cases
for country in country_names:
col = cases[[country]]
col.index = pd.to_datetime(cases.index)
col = col[(col.index >= first_date_df) & (col.index <= last_date_df)]
df[f"cases_{country}"] = col[country].to_numpy()
# Deaths
for country in country_names:
col = deaths[[country]]
col.index = pd.to_datetime(deaths.index)
col = col[(col.index >= first_date_df) & (col.index <= last_date_df)]
df[f"deaths_{country}"] = col[country].to_numpy()
df.index = pd.to_datetime(df.index) # datetime format to compare dates
df.head(5)
| Antigua and Barbuda | Argentina | Bahamas | Barbados | Belize | Bolivia | Brazil | Chile | Colombia | Costa Rica | ... | deaths_Nicaragua | deaths_Panama | deaths_Paraguay | deaths_Peru | deaths_Saint Lucia | deaths_Saint Vincent and the Grenadines | deaths_Suriname | deaths_Trinidad and Tobago | deaths_Uruguay | deaths_Venezuela | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 2020-12-24 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 1 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 |
| 2020-12-25 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 1 | 0 |
| 2020-12-26 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | ... | 0 | 1 | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 0 |
| 2020-12-27 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | ... | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 1 | 0 |
| 2020-12-28 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 1 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 1 |
5 rows × 128 columns
It doesn't make any sense to predict a variation in the number of vaccinations for a given country if nobody has been vaccinated yet in this country.
=> Extract the date of first vaccination for each country
starting_dates = {} # start when first vaccination occur in each country
for country in country_names:
starting_dates[country] = pd.to_datetime(df[df[country] == 1].index[0])
Plot autocorrelation of variation in COVID-19 Vaccinations in Colombia
from statsmodels.graphics.tsaplots import plot_acf, plot_pacf
plot_acf(x=df["Colombia"], lags=20)
plt.show()
Partial autocorrelation
plot_pacf(x=df["Colombia"], lags=20)
plt.show()
We notice above that the significance is low with a lag greater than 7.
Let $k\in\mathbb{N}^{*}$ be a lag to extract multiple data in the past at each timestamp ($k=7$ means that we take present data and data from the 6 previous days to make a prediction for the next day).
Let $(V_{t})_{t\in\mathbb{N}}$, $(D_{t})_{t\in\mathbb{N}}$, $(C_{t})_{t\in\mathbb{N}}$, $(M_{t})_{t\in\mathbb{N}}$ be the timeseries of vaccination variations, COVID deaths variations, COVID cases variations, and media coverage of vaccination in the country.
We want to fit a model that can predict for all $t\geq k-1$ the value of $V_{t+1}$ which represent the increase (1) or decrease (0) in the number of COVID vaccinations between the days $t$ and $t+1$.
One can formulate the problem like that: find $f: \left ( \{0,1\}^{3}\times [0,1]\right )^{k}\rightarrow \{0,1\}$ such that $\forall t\geq k-1$, $V_{t+1}\approx f(V_{t}, D_{t}, C_{t}, M_{t}, \ldots, V_{t-k+1}, D_{t-k+1}, C_{t-k+1}, M_{t-k+1})=\widehat{V_{t+1}}$
The quality of the model will be evaluated using different metrics between our predictions $(\widehat{V_{t}})_{t\in\mathbb{N}}$ and the true values $(V_{t})_{t\in\mathbb{N}}$.
Let's try $k=7$ first (one week of data)
k = 7 # last 7 days
X = [] # resulting dataset
features_name = ["increase_vac", "media_ratio", "increase_cases", "increase_deaths"] # name of the features
columns_name = features_name[:] # columns_name = total list of feature with lag features included
for j in range(1, k):
for feat in features_name:
columns_name.append(f"{feat}_l{j}")
columns_name.append("country_name")
columns_name.append("date")
columns_name.append("label")
# add lag features + country name (for interogation for bias later)
for country in country_names:
features = [country, f"ratio_{country}", f"cases_{country}", f"deaths_{country}"]
sub_df = df[features]
sub_df = sub_df[sub_df.index >= starting_dates[country]]
for j in range(1, k):
for feat in features:
sub_df[f"{feat}_l{j}"] = sub_df[feat].shift(j)
sub_df["country_name"] = country # add country name
sub_df["date"] = pd.to_datetime(sub_df.index) # add date (to sort later)
sub_df["label"] = sub_df[country].shift(-1) # add label (shift other direction)
sub_df.dropna(inplace=True) # drop nan values due to the shift(s)
X.extend(list(sub_df.to_numpy()))
X = pd.DataFrame(X, columns=columns_name)
X.sort_values(by="date", inplace=True) # sort by date !! (to avoid time related bias when evaluating test set)
y = X["label"] # y is the label
X.drop(columns=["label", "date"], inplace=True) # drop label and date
X.head(5)
| increase_vac | media_ratio | increase_cases | increase_deaths | increase_vac_l1 | media_ratio_l1 | increase_cases_l1 | increase_deaths_l1 | increase_vac_l2 | media_ratio_l2 | ... | increase_deaths_l4 | increase_vac_l5 | media_ratio_l5 | increase_cases_l5 | increase_deaths_l5 | increase_vac_l6 | media_ratio_l6 | increase_cases_l6 | increase_deaths_l6 | country_name | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 3231 | 0 | 0.224383 | 0 | 1 | 0.0 | 0.244272 | 1.0 | 0.0 | 0.0 | 0.239538 | ... | 0.0 | 1.0 | 0.258086 | 1.0 | 0.0 | 1.0 | 0.333841 | 0.0 | 0.0 | Chile |
| 3232 | 0 | 0.270036 | 1 | 0 | 0.0 | 0.224383 | 0.0 | 1.0 | 0.0 | 0.244272 | ... | 0.0 | 1.0 | 0.285285 | 0.0 | 0.0 | 1.0 | 0.258086 | 1.0 | 0.0 | Chile |
| 9439 | 0 | 0.309479 | 0 | 0 | 0.0 | 0.346581 | 0.0 | 0.0 | 0.0 | 0.335810 | ... | 1.0 | 0.0 | 0.383148 | 0.0 | 1.0 | 1.0 | 0.358477 | 1.0 | 1.0 | Mexico |
| 3233 | 0 | 0.284797 | 0 | 1 | 0.0 | 0.270036 | 1.0 | 0.0 | 0.0 | 0.224383 | ... | 1.0 | 0.0 | 0.259018 | 1.0 | 0.0 | 1.0 | 0.285285 | 0.0 | 0.0 | Chile |
| 3234 | 0 | 0.241434 | 0 | 0 | 0.0 | 0.284797 | 0.0 | 1.0 | 0.0 | 0.270036 | ... | 0.0 | 0.0 | 0.239538 | 1.0 | 1.0 | 0.0 | 0.259018 | 1.0 | 0.0 | Chile |
5 rows × 29 columns
Size of the dataset
print(f"Size of the dataset (nb_rows, nb_columns): {X.drop(columns=['country_name']).shape}")
Size of the dataset (nb_rows, nb_columns): (14389, 28)
Check missingness in our data (no missing data at all)
print(f"Maximum amount of missing data among all features: {X.isnull().sum().max()}")
Maximum amount of missing data among all features: 0
Split to create a test set (20% of the data, the most recent ones) to evaluate a posteriori the performance of a model.
Remove also the country name variable from X (will be used later to evaluate biases)
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X.drop(columns=["country_name"]), y, test_size=0.20, shuffle=False)
Objective:
Try the following models: Random Forest, Logistic Regression, AdaBoost
Use a Time series cross-validation
Compare AUC and other metrics
Hyperparameter tuning
Performance on unseen data: evaluation of the model on the test set
Understand our model: sensitivity analysis
Interogation for bias: does the created model perform equally across the different countries in Latam?
Let $x=(x_{1}, \ldots, x_{n})\in\mathbb{R}^{n}$ be our features for a particular element of our dataset.
Find $\theta\in\mathbb{R}^{n}$ such that $p(label|data)=\sigma (\theta^{T}\cdot x)=\frac{1}{1+e^{-\theta^{T}\cdot x}}$.
If $p\geq 0.5$, predict class 1, else 0.

Decision Tree:
Measure of impurity: Gini Index (but other exist, entropy, etc)
At node $t$, $Gini(t)=1-\sum_{j=1}^{J}(p(j|t))^{2}$ where $J$ is the number of classes (binary classification, $J=2$) and $p(j|t)$ is the relative frequency of class $j$ at node $t$.
When node $t$ is split on attribute $A$ into $k$ partitions (children), the quality of split is defined by: $Gini_{A}=\sum_{i=1}^{k}\frac{n_{i}}{n}Gini(i)$ where $n_{i}$ is the number of records at child $i$ and $n$ is the number of records at node $t$.
Training => until stopping criteria (records same attribute values/belong to the same class), split nodes $t$ by minimizing Gini.

Random Forest
Ensemble learning combine multiple algorithms to make better predictions by averaging them.
Random Forest is a bagging model, models make predictions in parallel. This increases the bias but decreases the variance.
Training => randomly select a subset of $m$ features and train a decision tree on a fraction of the training set (boostrap technique). Repeat the process to obtain $N$ decision trees.

AdaBoost is on the contrary a boosting model. This ensemble learning method reduces bias but increases variance.
Train model in cascade: model $i$ is used to train model $i+1$. At the end, aggregate many weak learners.
Dataset $D=((x_{i}, t_{i}))_{i\in\llbracket 1, n\rrbracket}$, start with a vector of equal weights $\forall i\in\llbracket 1, n\rrbracket, w_{i}^{(0)}=\frac{1}{n}$.
For all $t\in\mathbb{N}$, $y_{t}$ is the $t$-th model.
Train $y_{0}$ on the entire dataset with equal weights.
Let $J_{t}=\sum_{i=1}^{n}w_{i}^{(t)}\mathbb{1}(y_{t}(x_{i})\neq t_{i})$ be the cost function of classifier $t$.
=> Training
Calculate error rate of model $t$: $\gamma_{t}=\frac{J_{t}}{\sum_{i=1}^{n}w_{i}^{(t)}}$
Quality coefficient of classifier $t$ is: $\alpha_{t}=ln\left ( \frac{1-\gamma_{t}}{\gamma_{t}} \right )$
Update weights: $w_{i}^{(t+1)}=w_{i}^{(t)}e^{\alpha_{t}\mathbb{1}(y_{i}(x_{i})\neq t_{i})}$
=> Idea: insist on the "hard cases", misclassified ones.
Final model:

Source of the image: CentraleSupélec, 2EL1730, ML Lecture 05.
Fit the same model a multiple number of times on different splits of a dataset. Idea: make sure that we don't overfit a particular fold and that our model is truly learning patterns.
As we are working with timeseries, a Timeseries cross-validation is mandatory to avoid mixing future data and data from the past into the training+test set. Otherwise, the risk of overfitting is huge, and the model is more prone to dataset shift.
Calculate proportion of our label. If 50% of our labels are labelized as 1 and the other half 0, an accuracy of 0.8 is great. However, if we have 80% of our labels equals to 1, and "model" constently predicting 1 will also have an accuracy of 80%.
print(f"Proportion of class 1 in the training set: {round(100 * y_train.sum() / len(y_train), 2)}")
print(f"Proportion of class 1 in the test set: {round(100 * y_test.sum() / len(y_test), 2)}")
Proportion of class 1 in the training set: 53.33 Proportion of class 1 in the test set: 32.8
We notice a drop in proportion between our train set and our test set.
Let's use the Vulpes 🦊 (zorro) Python package (still in pre-alpha) to train many models with a great flexibility in terms of which preprocessing pipeline we want to apply, which cross-validation technique we should use, which metric do we want to calculate, etc.
from vulpes.automl import Classifiers
classifiers = Classifiers(preprocessing=None, cv="timeseries")
models_dict = dict(classifiers.models_to_try)
models_to_try = [(m, models_dict[m]) for m in ["RandomForestClassifier", "LogisticRegression", "AdaBoostClassifier"]]
classifiers.models_to_try = models_to_try
classifiers.fit(X_train, y_train)
0%| | 0/3 [00:00<?, ?it/s]
RandomForestClassifier
33%|███▎ | 1/3 [00:17<00:34, 17.22s/it]
LogisticRegression
67%|██████▋ | 2/3 [00:17<00:07, 7.54s/it]
AdaBoostClassifier
100%|██████████| 3/3 [00:20<00:00, 6.86s/it]
| Balanced Accuracy | Accuracy | Precision | Recall | F1 Score | AUROC | AUPRC | Micro avg Precision | Running time | |
|---|---|---|---|---|---|---|---|---|---|
| Model | |||||||||
| RandomForestClassifier | 0.844237 | 0.843379 | 0.842033 | 0.844237 | 0.842232 | 0.907676 | 0.913777 | 0.912353 | 17.209869 |
| LogisticRegression | 0.835917 | 0.837435 | 0.835639 | 0.835917 | 0.835682 | 0.908409 | 0.919295 | 0.919390 | 0.765026 |
| AdaBoostClassifier | 0.834616 | 0.836392 | 0.834762 | 0.834616 | 0.834522 | 0.904284 | 0.913905 | 0.914010 | 2.601188 |
All models perform well.
Remark: we should normally look at the standard deviation of the metrics to see if some fold are over/underperforming.
Let's train on the entire training dataset now.
classifiers.use_cross_validation = False
classifiers.fit(X_train, y_train)
0%| | 0/3 [00:00<?, ?it/s]
RandomForestClassifier
33%|███▎ | 1/3 [00:01<00:03, 1.73s/it]
LogisticRegression
67%|██████▋ | 2/3 [00:02<00:01, 1.32s/it]
AdaBoostClassifier
100%|██████████| 3/3 [00:03<00:00, 1.31s/it]
| Balanced Accuracy | Accuracy | Precision | Recall | F1 Score | AUROC | AUPRC | Micro avg Precision | Running time | |
|---|---|---|---|---|---|---|---|---|---|
| Model | |||||||||
| RandomForestClassifier | 0.829551 | 0.831090 | 0.832847 | 0.829551 | 0.830208 | 0.891767 | 0.889326 | 0.887335 | 1.728408 |
| LogisticRegression | 0.821478 | 0.822406 | 0.822758 | 0.821478 | 0.821855 | 0.894211 | 0.889794 | 0.889930 | 1.034183 |
| AdaBoostClassifier | 0.819542 | 0.820669 | 0.821343 | 0.819542 | 0.820004 | 0.889453 | 0.883121 | 0.883355 | 1.145862 |
# Import
import optuna
from sklearn.model_selection import (
cross_validate,
TimeSeriesSplit)
from sklearn.metrics import balanced_accuracy_score, make_scorer
# Extract our Random Forest pipeline (preprocessing + model, here no preprocessing)
pipe = classifiers.get_fitted_models()["RandomForestClassifier"]
# Cross validation and scoring method
cv = TimeSeriesSplit(n_splits=5)
scoring = {"balanced_accuracy": make_scorer(balanced_accuracy_score, greater_is_better=True)}
# Objective function: how we define which model is the best, how we tune hyperparameters
def objective(trial):
# test values of max_depth and n_estimators
randomforestclassifier__max_depth = trial.suggest_int("randomforestclassifier__max_depth", 2, 5)
randomforestclassifier__n_estimators = trial.suggest_int("randomforestclassifier__n_estimators", 50, 500)
# update model parameter
model_params = {"randomforestclassifier__n_estimators": randomforestclassifier__n_estimators, "randomforestclassifier__max_depth": randomforestclassifier__max_depth}
pipe.set_params(**model_params)
# cross validation on train set
cv_model = cross_validate(
pipe,
X_train,
y_train,
cv=cv,
return_estimator=True,
n_jobs=-1,
fit_params={},
scoring=scoring,
)
# return mean balanced accuracy on the different validation sets
value = np.nanmean(cv_model["test_balanced_accuracy"])
return value
# Create our study
n_trials = 20 # number of steps to tune hyperparameters
# Create study
# direction="maximize": we want to maximize balanced accuracy
# pruner (optional): early stopping strategy
study = optuna.create_study(direction="maximize", study_name="Tuning",
pruner=optuna.pruners.MedianPruner(n_warmup_steps=10))
study.optimize(objective, n_trials=n_trials)
study_result = study.trials_dataframe() # store the results in a dataframe
study_result.head(5)
[I 2022-08-03 01:06:19,328] A new study created in memory with name: Tuning [I 2022-08-03 01:06:27,839] Trial 0 finished with value: 0.8429293996530143 and parameters: {'randomforestclassifier__max_depth': 5, 'randomforestclassifier__n_estimators': 385}. Best is trial 0 with value: 0.8429293996530143. [I 2022-08-03 01:06:29,473] Trial 1 finished with value: 0.8382553343965146 and parameters: {'randomforestclassifier__max_depth': 2, 'randomforestclassifier__n_estimators': 81}. Best is trial 0 with value: 0.8429293996530143. [I 2022-08-03 01:06:33,872] Trial 2 finished with value: 0.8388932846270759 and parameters: {'randomforestclassifier__max_depth': 3, 'randomforestclassifier__n_estimators': 263}. Best is trial 0 with value: 0.8429293996530143. [I 2022-08-03 01:06:40,068] Trial 3 finished with value: 0.8400316510969397 and parameters: {'randomforestclassifier__max_depth': 4, 'randomforestclassifier__n_estimators': 325}. Best is trial 0 with value: 0.8429293996530143. [I 2022-08-03 01:06:42,776] Trial 4 finished with value: 0.8419018290676166 and parameters: {'randomforestclassifier__max_depth': 5, 'randomforestclassifier__n_estimators': 109}. Best is trial 0 with value: 0.8429293996530143. [I 2022-08-03 01:06:44,899] Trial 5 finished with value: 0.838602455901133 and parameters: {'randomforestclassifier__max_depth': 2, 'randomforestclassifier__n_estimators': 93}. Best is trial 0 with value: 0.8429293996530143. [I 2022-08-03 01:06:52,109] Trial 6 finished with value: 0.8370566340750478 and parameters: {'randomforestclassifier__max_depth': 2, 'randomforestclassifier__n_estimators': 455}. Best is trial 0 with value: 0.8429293996530143. [I 2022-08-03 01:06:55,831] Trial 7 finished with value: 0.8431237045681887 and parameters: {'randomforestclassifier__max_depth': 5, 'randomforestclassifier__n_estimators': 157}. Best is trial 7 with value: 0.8431237045681887. [I 2022-08-03 01:07:02,615] Trial 8 finished with value: 0.8428656616183436 and parameters: {'randomforestclassifier__max_depth': 5, 'randomforestclassifier__n_estimators': 300}. Best is trial 7 with value: 0.8431237045681887. [I 2022-08-03 01:07:05,032] Trial 9 finished with value: 0.8418662617126962 and parameters: {'randomforestclassifier__max_depth': 5, 'randomforestclassifier__n_estimators': 97}. Best is trial 7 with value: 0.8431237045681887. [I 2022-08-03 01:07:09,051] Trial 10 finished with value: 0.8399832020340174 and parameters: {'randomforestclassifier__max_depth': 4, 'randomforestclassifier__n_estimators': 203}. Best is trial 7 with value: 0.8431237045681887. [I 2022-08-03 01:07:15,742] Trial 11 finished with value: 0.8403376367558286 and parameters: {'randomforestclassifier__max_depth': 4, 'randomforestclassifier__n_estimators': 413}. Best is trial 7 with value: 0.8431237045681887. [I 2022-08-03 01:07:23,134] Trial 12 finished with value: 0.8429293996530143 and parameters: {'randomforestclassifier__max_depth': 5, 'randomforestclassifier__n_estimators': 384}. Best is trial 7 with value: 0.8431237045681887. [I 2022-08-03 01:07:26,556] Trial 13 finished with value: 0.8387392505230238 and parameters: {'randomforestclassifier__max_depth': 3, 'randomforestclassifier__n_estimators': 195}. Best is trial 7 with value: 0.8431237045681887. [I 2022-08-03 01:07:30,350] Trial 14 finished with value: 0.8402376660597497 and parameters: {'randomforestclassifier__max_depth': 4, 'randomforestclassifier__n_estimators': 185}. Best is trial 7 with value: 0.8431237045681887. [I 2022-08-03 01:07:39,370] Trial 15 finished with value: 0.8429875463091809 and parameters: {'randomforestclassifier__max_depth': 5, 'randomforestclassifier__n_estimators': 482}. Best is trial 7 with value: 0.8431237045681887. [I 2022-08-03 01:07:48,499] Trial 16 finished with value: 0.8435038973716166 and parameters: {'randomforestclassifier__max_depth': 5, 'randomforestclassifier__n_estimators': 494}. Best is trial 16 with value: 0.8435038973716166. [I 2022-08-03 01:07:53,070] Trial 17 finished with value: 0.8388897060785604 and parameters: {'randomforestclassifier__max_depth': 3, 'randomforestclassifier__n_estimators': 254}. Best is trial 16 with value: 0.8435038973716166. [I 2022-08-03 01:07:56,186] Trial 18 finished with value: 0.8397807697268995 and parameters: {'randomforestclassifier__max_depth': 4, 'randomforestclassifier__n_estimators': 159}. Best is trial 16 with value: 0.8435038973716166. [I 2022-08-03 01:08:03,302] Trial 19 finished with value: 0.8431261360769495 and parameters: {'randomforestclassifier__max_depth': 5, 'randomforestclassifier__n_estimators': 345}. Best is trial 16 with value: 0.8435038973716166.
| number | value | datetime_start | datetime_complete | duration | params_randomforestclassifier__max_depth | params_randomforestclassifier__n_estimators | state | |
|---|---|---|---|---|---|---|---|---|
| 0 | 0 | 0.842929 | 2022-08-03 01:06:19.336026 | 2022-08-03 01:06:27.839040 | 0 days 00:00:08.503014 | 5 | 385 | COMPLETE |
| 1 | 1 | 0.838255 | 2022-08-03 01:06:27.839040 | 2022-08-03 01:06:29.473370 | 0 days 00:00:01.634330 | 2 | 81 | COMPLETE |
| 2 | 2 | 0.838893 | 2022-08-03 01:06:29.488912 | 2022-08-03 01:06:33.872028 | 0 days 00:00:04.383116 | 3 | 263 | COMPLETE |
| 3 | 3 | 0.840032 | 2022-08-03 01:06:33.881227 | 2022-08-03 01:06:40.067004 | 0 days 00:00:06.185777 | 4 | 325 | COMPLETE |
| 4 | 4 | 0.841902 | 2022-08-03 01:06:40.069000 | 2022-08-03 01:06:42.775051 | 0 days 00:00:02.706051 | 5 | 109 | COMPLETE |
Visualize the hyperparameter tuning
from optuna.visualization import plot_parallel_coordinate
plot_parallel_coordinate(study)
from optuna.visualization import plot_contour
plot_contour(study)
from optuna.visualization import plot_slice
plot_slice(study)
from optuna.visualization import plot_param_importances
plot_param_importances(study)
Once we have an estimation of the best hyperparamters among the one tested, we can create our "best_rf" (Best Random Forest).
tuned_rf = pipe
tuned_rf.set_params(**study.best_params) # apply best parameters
# fit the tuned model on the entire train set
tuned_rf.fit(X_train, y_train)
tuned_rf
Pipeline(steps=[('randomforestclassifier',
RandomForestClassifier(max_depth=5, n_estimators=494,
n_jobs=-1, random_state=42))])
Plot ROC (Receiver Operating Characteristic) curves and compare their AUC (Area Under the Curve)
from sklearn.metrics import roc_curve, auc
def plot_roc(fpr, tpr, roc_auc, model_name, plot_baseline=True):
# plot ROC based on false positive rate, true positive rate and display roc_auc
lw = 2 # line width
plt.plot(
fpr,
tpr,
lw=lw,
label=f"AUC ROC {model_name} {round(roc_auc, 2)}",
)
if plot_baseline:
plt.plot([0, 1], [0, 1], color="navy", lw=lw, linestyle="--", label="Baseline")
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel("False Positive Rate")
plt.ylabel("True Positive Rate")
plt.title("Receiver Operating Characteristic")
plt.legend(loc="best")
def plot_roc_multiple(models, X_test, y_test):
# plot multiple roc curve for multiple models
plt.figure()
# Compute ROC curve and ROC area
is_first_model = True
for model_name, model in models.items():
y_pred = model.predict_proba(X_test) # probabilities
fpr, tpr, _ = roc_curve(y_test, y_pred[:, 1])
roc_auc = auc(fpr, tpr)
plot_roc(fpr, tpr, roc_auc, model_name, plot_baseline=is_first_model)
is_first_model = False
plt.show()
models = classifiers.get_fitted_models() # dictionnary of {model_name: fitted_model}
models["tuned_rf"] = tuned_rf # add tuned model
plot_roc_multiple(models, X_test, y_test)
Evaluate classification metrics for each class
from sklearn.metrics import classification_report
model = tuned_rf # let's evaluate our tuned model on the test set
y_pred = model.predict(X_test)
report = classification_report(y_test, y_pred)
print(report)
precision recall f1-score support
0.0 0.83 0.88 0.85 1934
1.0 0.72 0.64 0.68 944
accuracy 0.80 2878
macro avg 0.78 0.76 0.77 2878
weighted avg 0.80 0.80 0.80 2878
For Random Forest, we can use MDI (Mean Decrease Impurity across all trees) among many other techniques to evaluate the importance of each feature.
# extract feature importance using MDI with standard deviations
# !! the randomforest is the first element of the pipeline (we didn't apply any preprocessing on our dataset)
# => We access the model by using model[0]. Can be model[k] k>=1 in other cases
importances = model[0].feature_importances_ # feature importance
feature_names = model[0].feature_names_in_ # feature names
forest_importances = pd.DataFrame()
forest_importances["importances"] = importances
forest_importances.index = feature_names
std = np.std([tree.feature_importances_ for tree in model[0].estimators_], axis=0) # std of feature importance across trees
forest_importances["std"] = std
forest_importances["yerr"] = 1.96 * forest_importances["std"] / len(model[0].estimators_) # 95% confidence interval
forest_importances.sort_values(by="importances", ascending=False, inplace=True)
fig, ax = plt.subplots(figsize=(12, 8))
forest_importances["importances"].plot.bar(yerr=forest_importances["yerr"], ax=ax)
ax.set_title("Feature importances using MDI - 95% confidence interval")
ax.set_xlabel("Features")
ax.set_ylabel("Mean Decrease in Impurity")
plt.xticks(rotation=45)
fig.tight_layout()
Huge importance for previous increase in vaccination / values of media coverage in terms of vaccination.
Variations of deaths and cases during the week seem to be irrelevant as features in our model.
We can also plot one tree (an estimator of the random forest) to vizualize the decision that lead to a prediction in order to try to explain our model.
Plot only the 3 first depths.
from sklearn import tree
fig, axes = plt.subplots(1, 1, figsize=(8, 8), dpi=300)
decision_tree = model[0].estimators_[0]
tree.plot_tree(decision_tree, max_depth=3, feature_names=feature_names,
class_names=["Constant", "Increase"], filled=True,
fontsize=4)
[Text(0.5, 0.9, 'increase_vac_l4 <= 0.5\ngini = 0.497\nsamples = 7277\nvalue = [5347, 6164]\nclass = Increase'), Text(0.25, 0.7, 'increase_vac_l1 <= 0.5\ngini = 0.39\nsamples = 3430\nvalue = [3963, 1433]\nclass = Constant'), Text(0.125, 0.5, 'increase_vac_l2 <= 0.5\ngini = 0.311\nsamples = 2518\nvalue = [3180, 758]\nclass = Constant'), Text(0.0625, 0.3, 'media_ratio_l1 <= 0.199\ngini = 0.296\nsamples = 2037\nvalue = [2567, 567]\nclass = Constant'), Text(0.03125, 0.1, '\n (...) \n'), Text(0.09375, 0.1, '\n (...) \n'), Text(0.1875, 0.3, 'increase_vac <= 0.5\ngini = 0.362\nsamples = 481\nvalue = [613, 191]\nclass = Constant'), Text(0.15625, 0.1, '\n (...) \n'), Text(0.21875, 0.1, '\n (...) \n'), Text(0.375, 0.5, 'increase_vac <= 0.5\ngini = 0.497\nsamples = 912\nvalue = [783, 675]\nclass = Constant'), Text(0.3125, 0.3, 'increase_vac_l2 <= 0.5\ngini = 0.366\nsamples = 495\nvalue = [604, 192]\nclass = Constant'), Text(0.28125, 0.1, '\n (...) \n'), Text(0.34375, 0.1, '\n (...) \n'), Text(0.4375, 0.3, 'media_ratio_l2 <= 0.186\ngini = 0.395\nsamples = 417\nvalue = [179, 483]\nclass = Increase'), Text(0.40625, 0.1, '\n (...) \n'), Text(0.46875, 0.1, '\n (...) \n'), Text(0.75, 0.7, 'increase_vac_l5 <= 0.5\ngini = 0.35\nsamples = 3847\nvalue = [1384, 4731]\nclass = Increase'), Text(0.625, 0.5, 'increase_vac <= 0.5\ngini = 0.475\nsamples = 802\nvalue = [775, 493]\nclass = Constant'), Text(0.5625, 0.3, 'increase_vac_l3 <= 0.5\ngini = 0.314\nsamples = 456\nvalue = [569, 138]\nclass = Constant'), Text(0.53125, 0.1, '\n (...) \n'), Text(0.59375, 0.1, '\n (...) \n'), Text(0.6875, 0.3, 'media_ratio_l1 <= 0.133\ngini = 0.465\nsamples = 346\nvalue = [206, 355]\nclass = Increase'), Text(0.65625, 0.1, '\n (...) \n'), Text(0.71875, 0.1, '\n (...) \n'), Text(0.875, 0.5, 'increase_vac_l2 <= 0.5\ngini = 0.22\nsamples = 3045\nvalue = [609, 4238]\nclass = Increase'), Text(0.8125, 0.3, 'media_ratio_l1 <= 0.119\ngini = 0.422\nsamples = 382\nvalue = [185, 427]\nclass = Increase'), Text(0.78125, 0.1, '\n (...) \n'), Text(0.84375, 0.1, '\n (...) \n'), Text(0.9375, 0.3, 'increase_vac <= 0.5\ngini = 0.18\nsamples = 2663\nvalue = [424, 3811]\nclass = Increase'), Text(0.90625, 0.1, '\n (...) \n'), Text(0.96875, 0.1, '\n (...) \n')]
Gini is low on the left/right extremities of the tree.
It seems that an increase in vaccination is likely to happen if it happend the days before.
Vaccination coverage was more likely to be constant when the vaccination coverage didn't increase the past few days and when the "COVID-19 Vaccination" media coverage is low nationwide.
Using the code below, we can evaluate our model per country.
from sklearn.metrics import balanced_accuracy_score
_, X_test_cty, _, _ = train_test_split(X, y, test_size=0.20, shuffle=False) # don't remove country_name this time from X
X_test_cty["label"] = y_test
country_list = []
acc_list = []
for country in X_test_cty["country_name"].unique():
sub_df = X_test_cty[X_test_cty["country_name"] == country] # filter attribute values and labels for this country in the test set
y_test_2 = sub_df["label"] # extract labels
X_test_2 = sub_df.drop(columns=["country_name", "label"]) # remove these variables (not predictors)
pred = model.predict(X_test_2) # make predictions
acc = balanced_accuracy_score(y_test_2, pred) # calculate balanced accuracy
country_list.append(country)
acc_list.append(acc)
accuracy_per_country = pd.DataFrame()
accuracy_per_country["Country Name"] = country_list
accuracy_per_country["Balanced Accuracy"] = acc_list
accuracy_per_country.sort_values(by="Balanced Accuracy", ascending=False, inplace=True) # sort the countries based on their accuracies
accuracy_per_country
| Country Name | Balanced Accuracy | |
|---|---|---|
| 24 | Venezuela | 1.000000 |
| 30 | Uruguay | 0.797354 |
| 19 | Trinidad and Tobago | 0.743172 |
| 31 | Ecuador | 0.702963 |
| 23 | Chile | 0.673963 |
| 2 | Colombia | 0.635714 |
| 11 | Mexico | 0.630946 |
| 10 | Cuba | 0.610836 |
| 25 | Argentina | 0.610294 |
| 1 | Brazil | 0.555556 |
| 0 | Guyana | 0.543736 |
| 20 | Saint Vincent and the Grenadines | 0.541667 |
| 18 | Bolivia | 0.538547 |
| 9 | Paraguay | 0.500000 |
| 22 | Jamaica | 0.500000 |
| 3 | Bahamas | 0.500000 |
| 28 | Grenada | 0.500000 |
| 26 | Dominica | 0.500000 |
| 4 | Belize | 0.500000 |
| 5 | Guatemala | 0.500000 |
| 6 | Antigua and Barbuda | 0.500000 |
| 21 | Honduras | 0.500000 |
| 12 | Saint Lucia | 0.500000 |
| 7 | Suriname | 0.500000 |
| 17 | Costa Rica | 0.500000 |
| 8 | Haiti | 0.500000 |
| 15 | Peru | 0.500000 |
| 14 | El Salvador | 0.500000 |
| 16 | Nicaragua | 0.500000 |
| 29 | Panama | 0.493243 |
| 27 | Dominican Republic | 0.488889 |
| 13 | Barbados | 0.486345 |
Let's plot the inequalities in terms of Balanced Accuracy of our model using a choropleth map.
import plotly.express as px
# Load geojson file with coordinates in South America
from urllib.request import urlopen
import json
with urlopen('https://raw.githubusercontent.com/codeforgermany/click_that_hood/main/public/data/south-america.geojson') as response:
geojson = json.load(response)
for feature in geojson["features"]:
feature["id"] = feature["properties"]["name"]
fig = px.choropleth(accuracy_per_country, geojson=geojson, locations="Country Name", color="Balanced Accuracy",
color_continuous_scale="reds",
labels={"Balanced Accuracy": "Balanced Accuracy"}
)
fig.update_geos(fitbounds="locations")
# fig.update_layout(height=300, margin={"r":0,"t":0,"l":0,"b":0})
fig.update_layout(title="Choropleth map: Balanced accuracy of our model on the test set (per country)")
fig.show()
Reformulate the problem/the strategy:
Does it make more sense/is it more useful to predict: 0-No increase, 1-"Short" increase, 2-"Large" increase in vaccination,
Make one model per country / add a group parameter to stratify by country the folds during the training
Add more features:
Create artificial features (product of two features, use exp, log, sqrt functions, etc)
Incorporate Google Trends data, Media Cloud data with other keywords, country specific sociodemographics data, etc
More Exploratory Data Analysis (EDA):
Look at the distribution of the features
Correlation matrix
More plot to detect patterns/anomalies
Tips: Use tools like pandas-profiling
Preprocessing:
Normalization (on non-categorical data especially)
Feature selection (based on feature importance, correlation, etc)
...
Models
Explore other models
Further tune the hyperparameters
Search for other sources of biases
Don't forget to evaluate as much as possible the impact on the performance of your preprocessing pipeline, your additional features, etc
More importantly
Be creative and have fun! <3 <3
Notebook on https://github.com/AdrienC21/makehealthlatam_introductorytraining_2022